Language Richness of the Web

نویسندگان

  • Martin Majlis
  • Zdenek Zabokrtský
چکیده

Abstract We have built a corpus containing texts in 106 languages from texts available on the Internet and on Wikipedia. The W2C Web Corpus contains 54.7 GB of text and the W2C Wiki Corpus contains 8.5 GB of text. The W2C Web Corpus contains more than 100 MB of text available for 75 languages. At least 10 MB of text is available for 100 languages. These corpora are a unique data source for linguists, since they outclass all published works both in the size of the material collected and the number of languages covered. This language data resource can be of use particularly to researchers specialized in multilingual technologies development. We also developed software that greatly simplifies the creation of a new text corpus for a given language, using text materials freely available on the Internet. Special attention was given to components for filtering and de-duplication that allow to keep the material quality very high.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

English Teachers Professional Development Needs for Web Development Skills: Meeting the Challenges of Teaching English Language in the Information Age

Utilizing the resources of the web in educational practices has made instructional processes more efficient and interesting and has made the learning process on the other hand much easier and attractive. With the web, English language teachers now have the option of engaging learners in online (web-based) instructions in addition to the use of conventional classroom instructions or alternativel...

متن کامل

An Executive Approach Based On the Production of Fuzzy Ontology Using the Semantic Web Rule Language Method (SWRL)

Today, the need to deal with ambiguous information in semantic web languages is increasing. Ontology is an important part of the W3C standards for the semantic web, used to define a conceptual standard vocabulary for the exchange of data between systems, the provision of reusable databases, and the facilitation of collaboration across multiple systems. However, classical ontology is not enough ...

متن کامل

The Impact of Computer–Assisted Language Learning (CALL) /Web-Based Instruction on Improving EFL Learners’ Pronunciation Ability

The purpose of this study was to investigate the effect of CALL/Web-based instruction on improving EFL learners’ pronunciation ability. To this end, 85 students who were enrolled in a language institute in Rasht were selected as subjects. These students were given the Oxford Placement Test in order to validate their proficiency levels. They were then divided into two groups of 30 and were...

متن کامل

language development and lexical awareness of bilingual (Azeri -Persian) hard of hearing impaired children

The Relationship between Mean Length of utterance (MLU), Lexical Richness and syntactical and lexical metalinguistic Awareness in Bilingual (Turkish-Persian) normal and hearing impaired Children   Objectives: Regarding the impact of hearing loss on language development and metalinguistic skill and being language development different from metalinguistic skill in bilingual children, studying of...

متن کامل

Impact of Dynamic Assessment on the Writing Performance of English as Foreign Language Learners in Asynchronous Web 2.0 and Face-to-face Environments

This study sought to investigate dynamic assessment (DA) - an assessment approach that embeds inter- vention within the assessment process and that yields information about the learner’s responsiveness to this intervention - and the writing performance of the second language (L2) learners in Web 2.0 contexts. To this end, pre and post-treatment writings of 45 par...

متن کامل

Impact of Online Setting Collaboration through Strategy-Based Instruction on EFL Learners’ Self-efficacy and Oral Skills

This study aimed to investigate the impact of web-based cooperative teaching through strategy-based instruction on EFL learners’ speaking and listening skills. Moreover, the use of cooperative teaching was hypothesized to have impact on the EFL learners’ self-efficacy. To this purpose, the study followed a mixed-methods design by implementing both qualitative and quantitative data gathering pro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012